This report analyzes 46 IPV (Intimate Partner Violence) detection experiments using Large Language Models, examining 14,790 narratives. The analysis reveals strong overall accuracy (93.1%) but identifies critical opportunities for improvement in recall (64.2%).
Key Performance Indicators:
🎯 Main Finding: The model exhibits high false negative rates, meaning it misses actual IPV cases. This is the primary area requiring attention through prompt engineering and threshold adjustment.
The primary model mlx-community/gpt-oss-120b was used in 36 experiments (94.7% of completed runs).
⚠️ Performance Gap Identified:
The recall score of 64.2% is significantly lower than accuracy (93.1%), indicating the model misses approximately 35.8% of actual IPV cases. This is a critical issue for a detection system where false negatives have serious consequences.
Confusion Matrix Breakdown:
Error Analysis:
The best performing prompt version is v0.3.2_indicators with an F1 score of 0.784.
Average prompt length: 2,636 characters
📊 Optimal Configuration: Temperature range 0.1-0.25 achieves the best F1 score of 0.687.
Runtime Statistics:
Best Performing Setup:
This configuration should serve as the baseline for future experiments.
The 64.2% recall rate means 35.8% of actual IPV cases are missed.
Immediate Actions: - Expand IPV indicator examples in prompts (especially subtle cases) - Add more diverse relationship violence scenarios - Consider lowering the confidence threshold for positive detection - Review false negative cases for pattern identification
Priority testing queue:
Data Completeness:
⚠️ 2 Failed Experiments - Review error logs for root cause
⚠️ 3 Stalled Experiments - May need manual intervention
ℹ️ 3 Running Experiments - Wait for completion before final analysis
⚠️ 3 Accuracy Outliers Detected (>2σ from mean)
Test qwen/qwen3-next-80b new prompt 2025-10-03:
78.0%Test qwen/qwen3-next-80b with modified prompt 2025-10-03:
84.7%Test GPT-OSS-120B Baseline: 100.0%With an average F1 score of 0.681, the system demonstrates moderate performance. However, the 64.2% recall rate indicates significant room for improvement in detecting actual IPV cases. :::
Consistency: MODERATE (F1 σ = 0.086)
The moderate variation across experiments suggests investigating sources of variation.
Error Profile: BALANCED
False positives and negatives are balanced.
Expected Impact of Recommendations:
If recall can be improved to 80% while maintaining current precision: - New F1 Score: ~0.78 (current: 0.681) - Reduction in missed cases: ~40% - Overall system reliability: Substantially improved
Database Connection: - Host: memini.lan - Port: 5433 - Database: postgres - Report Generated: 2025-10-05 11:55:13 EDT
Analysis Parameters: - Total Experiments Analyzed: 46 - Completed Experiments: 38 - Total Narratives: 14,790 - Date Range: 2025-10-03 to 2025-10-04
This report was automatically generated from PostgreSQL
experimental data.
For questions or issues, contact the research team or review source
tables: experiments and
narrative_results.